# Multimodal large model
Heron NVILA Lite 33B
Apache-2.0
Heron-NVILA-Lite-33B is a vision-language model based on the NVILA-Lite architecture, specifically trained for Japanese, and supports multimodal tasks in both Japanese and English.
Image-to-Text Supports Multiple Languages
H
turing-motors
99
3
Internvl3 2B Hf
Other
InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.
Image-to-Text
Transformers Other

I
OpenGVLab
41.22k
2
Qari OCR 0.3 SNAPSHOT VL 2B Instruct Merged
A vision-language model designed specifically for Arabic optical character recognition (OCR), capable of directly recognizing Arabic text in images.
Image-to-Text
Transformers

Q
NAMAA-Space
467
0
Internlm Xcomposer2d5 Ol 7b
Other
InternLM-XComposer2.5-OL is a comprehensive multimodal system supporting long-term streaming video and audio interaction.
Text-to-Image
I
internlm
79
49
Xgen Mm Phi3 Mini Base R V1
Apache-2.0
XGen-MM is the latest multimodal large model series developed by Salesforce AI Research. Based on the successful design of BLIP, it achieves a more powerful and superior model architecture through fundamental enhancements.
Image-to-Text
Transformers English

X
Salesforce
240
18
Internlm Xcomposer2 Vl 1 8b
Other
A vision-language large model based on InternLM2 with outstanding image-text understanding and creation capabilities
Text-to-Image
Transformers

I
internlm
169
18
Featured Recommended AI Models